The purpose of this document is to serve as an example of basic coding and analysis ability. R is my tool of choice for “telling a story with data”, and the R Markdown format allows direct translation of a script to a more readable format.
The original goal of this project was to devise a systematic approach to learning about the playable heroes in the competitive online game DOTA 2, less formally referred to as Dota (What’s a Dota?). Heroes are characterized in part by a number of statistics (how much damage they do, how much they can take, etc) which are partially available in a tabular format from the Dota Wiki. With this data in mind, it was also a chance to apply k-means clustering to a novel task where the natural groupings were not known, yet they would have meaningful interpretation in the desired context.
The case for application of k-means clustering is straightforward:
*For a sense of the challenge, OpenAI managed to train their model to reasonable performance on only 17 heroes after 10 months (and 45,000 simulated years) of Dota practice. 25 heroes was a stretch goal that they fell short of.
Using the rvest package, the table of data can be pulled directly from a reference to the URL and element path. Since Dota is frequently updated, often changing these base values, it was preferable for the process to be as automatic as possible. The source table on the wiki can be found here for reference.
Most of the data is collected in an R-readable format, which is one advantage of pulling directly from a webpage. The first adjustment was to re-name the columns to slightly more consistent and R-friendly labels. However, the second column for a hero’s Attribute did not come though at all; it contained icons rather than text that R could not handle.
Unfortunately there is no online table for the Attribute data, so for the sake of keeping this document self-contained, the information is hardcoded as a lookup. At this point, all of the raw stats from the game have been captured in an easy-to-access format. I made use of the package DT to generate an interactive table in order to explore the complete data set, as included below.
A cursory glance suggests that Agility, Strength, and Intelligence heroes share some stat similarities. However, these archetypes are not firm rules, and these stats are not sufficient to describe in-game performance. Base stats do not reveal directly which heroes can cause/survive the most/least damage, which is the basic model for combat in the game. The goal is to determine which heroes have similar in-game characteristics, so the data requires further processing to become meaningful.
url <- 'https://dota2.gamepedia.com/Table_of_hero_attributes'
tbl <- '//*[@id="mw-content-text"]/div/div[2]/table' # found by 'inspect element' in a Chromium-based browser
dat <- html_table(html_node(x=read_html(url), xpath=tbl))| HERO | A | STR | STR+ | STR25 | AGI | AGI+ | AGI25 | INT | INT+ | INT25 | T | T+ | T25 | MS | AR | DMG(MIN) | DMG(MAX) | RG | BAT | ATKPT | ATKBS | VS-D | VS-N | TR | COL | HP/S | L |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Abaddon | NA | 23 | 3.0 | 95.0 | 23 | 1.5 | 59.0 | 21 | 2.0 | 69.0 | 67 | 6.5 | 223.0 | 325 | 1.68 | 55 | 65 | 150 | 1.7 | 0.56 | 0.41 | 1800 | 800 | 0.5 | 24 | 0.00 | 2 |
| Alchemist | NA | 25 | 2.4 | 82.6 | 22 | 1.2 | 50.8 | 25 | 1.8 | 68.2 | 72 | 5.4 | 201.6 | 305 | 2.52 | 49 | 58 | 150 | 1.7 | 0.35 | 0.65 | 1800 | 800 | 0.6 | 24 | 0.00 | 2 |
| Ancient Apparition | NA | 20 | 1.9 | 65.6 | 20 | 2.2 | 72.8 | 23 | 3.4 | 104.6 | 63 | 7.5 | 243.0 | 290 | 2.20 | 44 | 54 | 675 | 1.7 | 0.45 | 0.30 | 1800 | 800 | 0.6 | 24 | 0.00 | 2 |
| Anti-Mage | NA | 23 | 1.3 | 54.2 | 24 | 3.2 | 100.8 | 12 | 1.8 | 55.2 | 59 | 6.3 | 210.2 | 310 | 2.84 | 53 | 57 | 150 | 1.4 | 0.30 | 0.60 | 1800 | 800 | 0.5 | 24 | 0.25 | 2 |
| Arc Warden | NA | 25 | 3.0 | 97.0 | 15 | 2.4 | 72.6 | 24 | 2.6 | 86.4 | 64 | 8.0 | 256.0 | 280 | 0.40 | 46 | 56 | 625 | 1.7 | 0.30 | 0.70 | 1800 | 800 | 0.6 | 24 | 0.25 | 2 |
| Axe | NA | 25 | 3.4 | 106.6 | 20 | 2.2 | 72.8 | 18 | 1.6 | 56.4 | 63 | 7.2 | 235.8 | 295 | 1.20 | 49 | 53 | 150 | 1.7 | 0.50 | 0.50 | 1800 | 800 | 0.6 | 24 | 2.75 | 2 |
# Rename some columns to avoid using reserved characters in column names
dat <- `colnames<-`(dat, c('HERO','A','STR','STR.inc','STR25','AGI','AGI.inc','AGI25','INT','INT.inc','INT25',
'TOTAL','TOTAL.inc','TOTAL25','MS','AR','DMG.MIN','DMG.MAX','RG','BAT','ATK.PT','ATK.BS',
'VS.D','VS.N','TR','COL','HP.S','L'))
# Note that column 'A' contains only NAs
# Each hero has their primary Attribute as one of Agility, Intelligence, or Strength
a.agi = c('Anti-Mage', 'Arc Warden', 'Bloodseeker', 'Bounty Hunter', 'Broodmother', 'Clinkz', 'Drow Ranger', 'Ember Spirit',
'Faceless Void', 'Gyrocopter', 'Juggernaut', 'Lone Druid', 'Luna', 'Medusa', 'Meepo', 'Mirana', 'Monkey King',
'Morphling', 'Naga Siren', 'Nyx Assassin', 'Pangolier', 'Phantom Assassin', 'Phantom Lancer', 'Razor', 'Riki',
'Shadow Fiend', 'Slark', 'Sniper', 'Spectre', 'Templar Assassin', 'Terrorblade', 'Troll Warlord', 'Ursa',
'Vengeful Spirit', 'Venomancer', 'Viper', 'Weaver')
a.int = c("Ancient Apparition", "Bane", "Batrider", "Chen", "Crystal Maiden", "Dark Seer", "Dark Willow", "Dazzle",
"Death Prophet", "Disruptor", "Enchantress", "Enigma", "Grimstroke", "Invoker", "Jakiro", "Keeper of the Light",
"Leshrac", "Lich", "Lina", "Lion", "Nature's Prophet", "Necrophos", "Ogre Magi", "Oracle", "Outworld Devourer",
"Puck", "Pugna", "Queen of Pain", "Rubick", "Shadow Demon", "Shadow Shaman", "Silencer", "Skywrath Mage",
"Storm Spirit", "Techies", "Tinker", "Visage", "Warlock", "Windranger", "Winter Wyvern", "Witch Doctor", "Zeus")
a.str = c('Abaddon', 'Alchemist', 'Axe', 'Beastmaster', 'Brewmaster', 'Bristleback', 'Centaur Warrunner', 'Chaos Knight',
'Clockwerk', 'Doom', 'Dragon Knight', 'Earth Spirit', 'Earthshaker', 'Elder Titan', 'Huskar', 'Io', 'Kunkka',
'Legion Commander', 'Lifestealer', 'Lycan', 'Magnus', 'Mars', 'Night Stalker', 'Omniknight', 'Phoenix', 'Pudge',
'Sand King', 'Slardar', 'Spirit Breaker', 'Sven', 'Tidehunter', 'Timbersaw', 'Tiny', 'Treant Protector', 'Tusk',
'Underlord', 'Undying', 'Wraith King')
dat$A[dat$HERO %in% a.agi] = 'Agi'
dat$A[dat$HERO %in% a.int] = 'Int'
dat$A[dat$HERO %in% a.str] = 'Str'
head(dat)| HERO | A | STR | STR.inc | STR25 | AGI | AGI.inc | AGI25 | INT | INT.inc | INT25 | TOTAL | TOTAL.inc | TOTAL25 | MS | AR | DMG.MIN | DMG.MAX | RG | BAT | ATK.PT | ATK.BS | VS.D | VS.N | TR | COL | HP.S | L |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Abaddon | Str | 23 | 3.0 | 95.0 | 23 | 1.5 | 59.0 | 21 | 2.0 | 69.0 | 67 | 6.5 | 223.0 | 325 | 1.68 | 55 | 65 | 150 | 1.7 | 0.56 | 0.41 | 1800 | 800 | 0.5 | 24 | 0.00 | 2 |
| Alchemist | Str | 25 | 2.4 | 82.6 | 22 | 1.2 | 50.8 | 25 | 1.8 | 68.2 | 72 | 5.4 | 201.6 | 305 | 2.52 | 49 | 58 | 150 | 1.7 | 0.35 | 0.65 | 1800 | 800 | 0.6 | 24 | 0.00 | 2 |
| Ancient Apparition | Int | 20 | 1.9 | 65.6 | 20 | 2.2 | 72.8 | 23 | 3.4 | 104.6 | 63 | 7.5 | 243.0 | 290 | 2.20 | 44 | 54 | 675 | 1.7 | 0.45 | 0.30 | 1800 | 800 | 0.6 | 24 | 0.00 | 2 |
| Anti-Mage | Agi | 23 | 1.3 | 54.2 | 24 | 3.2 | 100.8 | 12 | 1.8 | 55.2 | 59 | 6.3 | 210.2 | 310 | 2.84 | 53 | 57 | 150 | 1.4 | 0.30 | 0.60 | 1800 | 800 | 0.5 | 24 | 0.25 | 2 |
| Arc Warden | Agi | 25 | 3.0 | 97.0 | 15 | 2.4 | 72.6 | 24 | 2.6 | 86.4 | 64 | 8.0 | 256.0 | 280 | 0.40 | 46 | 56 | 625 | 1.7 | 0.30 | 0.70 | 1800 | 800 | 0.6 | 24 | 0.25 | 2 |
| Axe | Str | 25 | 3.4 | 106.6 | 20 | 2.2 | 72.8 | 18 | 1.6 | 56.4 | 63 | 7.2 | 235.8 | 295 | 1.20 | 49 | 53 | 150 | 1.7 | 0.50 | 0.50 | 1800 | 800 | 0.6 | 24 | 2.75 | 2 |
Data can be searched, sorted, and filtered using the GUI rather than R commands.
EXAMPLE: Sorting by Stength at level 25 (col 5), we can see the Attribute (col 2) for 23 of the top 25 heroes is Strength. When sorting by level 1 Strength (col 3), only 15 of the top 25 are Strength heroes. Clearly, Attribute alone is not a sufficient guide to what stats a hero will have at every point in the game.
For the desired comparisons, some in-game characteristics of the heroes will need to be calculated according to the convoluted mechanics of Dota, a guide to which can be found on the Mechanics page on the Dota Wiki. data.table syntax is used mostly as a personal preference over base R for condensing some of the more complex calculations, in the pursuit of more readable code.
Every hero starts each game of Dota at level 1 with their base stats. Heroes level up to a maximum of 25, with their stats being increased at a constant rate each level. Rather than calculate statistics for every level, for the purpose of general comparison only a few key levels were selected/calculated in order to summarize the strengths and weaknesses of each hero at significant points in the game.
At this point, the data required to compare and cluster heroes based on combat ability are ready. However, the data at this stage already allow for manual comparisons between a few heroes at a time, and this can be done in real-time using the plot_ly package.
derive.attributes <- function(dat.sample){
# input is a single data frame with all named hero attributes, for any selected subset of heroes
# outputs the calculated hero stats as a data frame to be used for cluster comparisons
setDT(dat.sample)
AGI.HERO <- dat.sample$A=='Agi'
INT.HERO <- dat.sample$A=='Int'
STR.HERO <- dat.sample$A=='Str'
# Attack damage is random over a certain range for each hero
baseDamage <- dat.sample[,(DMG.MIN+DMG.MAX)/2]
# Attack damage increases each level with the stat that matches the primary attribute of the hero
damage.inc <- rowSums(matrix(data=c(dat.sample$AGI.inc*AGI.HERO,
dat.sample$INT.inc*INT.HERO,
dat.sample$STR.inc*STR.HERO),
ncol=3))
# Attack speed (rate) is a function of Agility and Base Attack Time
baseAS <- dat.sample[,(1 + AGI/100) / BAT]
AS10 <- dat.sample[,(1 + (AGI + 9*AGI.inc)/100) / BAT]
AS20 <- dat.sample[,(1 + (AGI + 19*AGI.inc)/100) / BAT]
# Mana for using abilities is determined by Intelligence
baseMana <- 75 + 12 * dat.sample$INT
Mana.inc <- 12 * dat.sample$INT.inc
baseMana.regen <- 1 + 0.05 * dat.sample$INT
Mana.regen.inc <- 0.05 * dat.sample$INT.inc
# Health Points are dependent on Strength
baseHP <- 200 + 20 * dat.sample$STR
HP.inc <- 18 * dat.sample$STR.inc
baseHP.regen <- dat.sample[,HP.S + 0.09 * STR]
HP.regen.inc <- 0.09 * dat.sample$STR.inc
armour10 <- dat.sample[,AR + 9*AGI.inc * 0.16]
armour20 <- dat.sample[,AR + 19*AGI.inc * 0.16]
MvSpd25=dat.sample[,MS * (1 + AGI25*0.0007)]
Spell.Amp25=dat.sample$INT25 * (0.0005)
Magic.Resist25=dat.sample$STR25 * (0.0008)
# final calculated characteristics of interest
Dmg=baseDamage; Dmg5=baseDamage+4*damage.inc; Dmg10=baseDamage+9*damage.inc
DPS=baseDamage*baseAS; DPS10=Dmg10*AS10; DPS20=(baseDamage+19*damage.inc)*AS20;
Mana=baseMana;
Mana10=baseMana+9*Mana.inc;
Mana20=baseMana + 19*Mana.inc;
Mana.Regen=baseMana.regen;
Mana.Regen10=baseMana.regen+9*Mana.regen.inc;
Mana.Regen20=baseMana.regen+19*Mana.regen.inc;
Arm=dat.sample[,(0.052*AR)/(0.9 + 0.048*AR)];
Arm10=(0.052*armour10)/(0.9 + 0.048*armour10);
Arm20=(0.052*armour20)/(0.9 + 0.048*armour20);
HP=baseHP;
HP10=baseHP+9*HP.inc;
HP20=baseHP+19*HP.inc;
HP.Regen=baseHP.regen;
HP.Regen10=baseHP.regen+9*HP.regen.inc;
HP.Regen20=baseHP.regen+19*HP.regen.inc;
EffHP=HP/(1-Arm);
EffHP10=HP10/(1-Arm10);
EffHP20=HP20/(1-Arm20);
result <-data.frame(Hero=dat.sample$HERO, Attr=substr(dat.sample$A,1,3), #2
Range=dat.sample$RG, MvSpd=dat.sample$MS, MvSpd25, #5
Dmg, DPS, Dmg5, Dmg10, DPS10, DPS20, #11
Mana, Mana10, Mana20, #14
Mana.Regen, Mana.Regen10, Mana.Regen20, Spell.Amp25, #18
Arm, Arm10, Arm20, #21
HP, EffHP, HP10, EffHP10, HP20, EffHP20, #27
HP.Regen, HP.Regen10, HP.Regen20, Magic.Resist25, #31
VisionD=dat.sample$VS.D, VisionN=dat.sample$VS.N) #33
# columns for number formatting
whole <- c(3,4,5,12,13,14,22,23,24,25,26,27,32,33)
tenth <- c(6,7,8,9,10,11,15,16,17,20,28,29,30)
prcnt <- c(18,19,20,21,31)
result[,whole] <- round(result[,whole], digits=0);
result[,tenth] <- round(result[,tenth], digits=1);
result[,prcnt] <- round(result[,prcnt], digits=2)
return(result)
}
sample = TRUE # default/all
# can be any subset of heroes for testing
derived <- derive.attributes(dat[sample])Data can be searched, sorted, and filtered using the GUI rather than R commands.
EXAMPLE: Sorting by level 1 damage, (Dmg) it can be noted that 17 of the top 25 heroes are Strength-attribute heroes; yet when we sort by level 20 damage-per-second (DPS20), 17 of the leading 25 are Agility-attribute heroes. Offensive ability in early levels has a low correlation with how capable a hero will be at higher levels, illustrating the need to classify heroes on several dimensions for a more complete understanding.
These archetypes can be useful as a general guideline, but this comparison also highlights the significant number of heroes that go against the trend for their attribute, highlighting the difficulty in learning which heroes can be expected to have which capabilities.
For a less cluttered visual display, 17/33 variables are displayed for relative comparisons. These measures are enough to describe the hero in terms related to gameplay.
EXAMPLE: When highlighting the hero Anti-Mage, it is easy to see at a glance that his Damage-Per-Second ratings are well above average, while Mana and HP-related measures fall noticably short. This translates to a strategy of targetting important enemies and objectives, while trying to avoid being caught himself.
Adding Chaos Knight to the plot reveals the hero is similar in having above-average damage and low Mana, but the difference in HP/Effective HP makes him capable of absorbing much more damage. In a one-on-one fight of basic attacks, Chaos Knight would be expected to win if Anti-Mage does not run away.
# for the visual comparison, 17 of the 33 variables are represented
group.heroes=melt(data.frame(Hero=derived$Hero,
scale(x=derived[c("Range","MvSpd","Dmg","DPS","Dmg5","Dmg10","DPS10",
"DPS20","Mana","Mana10","Mana20","HP","EffHP","HP10",
"EffHP10","HP20","EffHP20")])),
id.vars='Hero')
plot_ly(data=group.heroes, x=~variable, y=~value, color=~factor(Hero), hoverinfo='all',
type='scatter', mode='lines+markers', colors='Dark2', alpha=0.8) %>%
layout(title=paste('All Heroes comparison, centred and standardized, v7.21'),
xaxis=list(title='Normalized Attributes'), yaxis=list(title='Deviation'))The process of learning a set of clusters for a given dataset is trivial in R; the more interesting question is determing the optimal number of clusters to be learned. With the goal of learning to play clusters of heroes, we can consider a relevant range of groupings to try. Having a single cluster provides no new information at all, having too many clusters reduces the value of abstraction provided. A range of 2-10 clusters serves as an informal constraint for optimization.
Based on contextual knowledge from the game, seeing five clusters emerge (corresponding to the five informal roles on a team) would be my initial expectation. Another scheme divides heroes into two main roles, as either a ‘support’ (strong early, but falls off later) or a ‘carry’ (generally weak and reliant on supports early, but exponentially stronger as they gain more levels).
The fit of a given set of clusters is measured by the within-cluster (from each point to its assigned cluster centre) and between-cluster (from each cluster centre to the others) sum of squared distances. An optimal solution will maximize the distinction between clusters, and minimize the spread within each cluster, with the fewest centres. Since the within-SS will decrease and the between-SS will increase monotonically as more clusters are permitted, optimization becomes about finding the lowest number of clusters before diminishing returns occur. Visual inspection for this change in first-differences is called the ‘elbow’ or ‘knee’ method. The most obvious ‘elbow’ is caused by just two clusters, and it is difficult to suggest any alternatives from visuals alone.
Taking a more formal approach, the package NbClust offers a large number of optimization criteria for the number of clusters, as outlined by the authors in NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Each criterion is calculated, the suggested optimum is recorded, and the results are tabulated as votes for each solution. Note: The Hubert index and D index are graphical methods that do not contribute to the votes, but visually indicate a solution.
The top two solutions, two and three clusters will be investigated for meaning and suitability.
(Due to the ever-changing nature of Dota, the votes for each solution also change over time as the game is updated. Six clusters was a consistent second-place through several patches before this document was published and remains the third-place recommendation, and so will be included for consideration.)
# select a subset of features that capture the greatest variability, and have minimal collinearity
k.means.data <- scale(derived[c('Range','MvSpd25','Dmg','DPS','DPS20','Arm','Arm20','Mana','Mana20',
'HP','EffHP','HP20','EffHP20')])
# trial for number of clusters
max.k <- 10
within.ss <- sapply(1:max.k, function(k){
kmeans(k.means.data, centers=k, nstart=50, iter.max=50)[c('tot.withinss','betweenss')] })
# consider sum-squared distances within & between clusters, and the average between-SS per cluster
plot(1:max.k, within.ss['tot.withinss',], type='b', ylim=c(0,1500),
main='Visual Inspection - Elbow Method', xlab='# of clusters', ylab='Sum of Squared Distances')
points(1:max.k, within.ss['betweenss',], type='b', col='grey')
points(1:max.k, as.numeric(within.ss['betweenss',])/(1:10), type='b', col='blue')
abline(v=2, lty=2)
legend(6,1500, legend=c('Within-SumSq','Between-SumSq','Between-SumSq per cluster'),
col=c('black','grey','blue'), lty=1, bty='n')# calculate criteria for optimal number of clusters,
# and apply voting to rank different solutions
votes <- NbClust(data=k.means.data, distance="euclidean", min.nc=2, max.nc=10, method='kmeans')## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 9 proposed 2 as the best number of clusters
## * 5 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 5 proposed 9 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
## KL CH Hartigan CCC Scott Marriot
## 9 2 9 2 7 5
## TrCovW TraceW Friedman Rubin Cindex DB
## 3 9 4 9 2 9
## Silhouette Duda PseudoT2 Beale Ratkowsky Ball
## 2 2 2 2 2 3
## PtBiserial Frey McClain Dunn Hubert SDindex
## 3 1 2 3 0 3
## Dindex SDbw
## 0 10
# run selected model to fit and classify heroes
k.means.fit.2 <- kmeans(k.means.data, centers=2, nstart=500, iter.max=100)
clusters.2 <- cbind(ClustNo=k.means.fit.2$cluster, derived[,1:31])
# cluster centers can be interpreted as describing the average/representative hero for each cluster
data.frame(Cluster=1:2, round(k.means.fit.2$centers,2))| Cluster | Range | MvSpd25 | Dmg | DPS | DPS20 | Arm | Arm20 | Mana | Mana20 | HP | EffHP | HP20 | EffHP20 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | -0.80 | 0.30 | 0.45 | 0.47 | 0.26 | 0.30 | 0.24 | -0.45 | -0.62 | 0.43 | 0.54 | 0.48 | 0.67 |
| 2 | 0.88 | -0.32 | -0.49 | -0.51 | -0.29 | -0.32 | -0.26 | 0.49 | 0.68 | -0.47 | -0.59 | -0.52 | -0.73 |
# visual comparison of all clusters, normalized variables
group.2 <- melt(data.frame(Hero=derived$Hero, ClustNo=k.means.fit.2$cluster, k.means.data),
id.vars=c('Hero','ClustNo'))
plot_ly(data=group.2, x=~variable, y=~value, color=~factor(ClustNo), text=~Hero, hoverinfo='text+name',
type='scatter', mode='markers', colors='Dark2', alpha=0.4) %>%
layout(title='All Clusters', xaxis=list(title='Normalized Attributes'), yaxis=list(title='Deviation'))# run selected model to fit and classify heroes
k.means.fit.3 <- kmeans(k.means.data, centers=3, nstart=500, iter.max=100)
clusters.3 <- cbind(ClustNo=k.means.fit.3$cluster, derived[,1:31])
# cluster centers can be interpreted as describing the average/representative hero for each cluster
data.frame(Cluster=1:3, round(k.means.fit.3$centers,2))| Cluster | Range | MvSpd25 | Dmg | DPS | DPS20 | Arm | Arm20 | Mana | Mana20 | HP | EffHP | HP20 | EffHP20 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.97 | -0.34 | -0.49 | -0.50 | -0.27 | -0.37 | -0.28 | 0.47 | 0.72 | -0.56 | -0.69 | -0.56 | -0.79 |
| 2 | -0.88 | -0.02 | 0.75 | 0.51 | -0.21 | -0.09 | -0.40 | -0.30 | -0.49 | 0.84 | 0.58 | 0.96 | 0.58 |
| 3 | -0.52 | 0.72 | -0.23 | 0.17 | 0.89 | 0.89 | 1.24 | -0.47 | -0.67 | -0.24 | 0.45 | -0.43 | 0.66 |
# visual comparison of all clusters, normalized variables
group.3 <- melt(data.frame(Hero=derived$Hero, k.means.data, ClustNo=k.means.fit.3$cluster),
id.vars=c('Hero','ClustNo'))
plot_ly(data=group.3, x=~variable, y=~value, color=~factor(ClustNo), text=~Hero, hoverinfo='text+name',
type='scatter', mode='markers', colors='Dark2', alpha=0.5) %>%
layout(title='All Clusters', xaxis=list(title='Normalized Attributes'), yaxis=list(title='Deviation'))# run selected model to fit and classify heroes
k.means.fit.6 <- kmeans(k.means.data, centers=6, nstart=500, iter.max=100)
clusters.6 <- cbind(ClustNo=k.means.fit.6$cluster, derived[,1:31])
# cluster centers can be interpreted as describing the average/representative hero for each cluster
data.frame(Cluster=1:6, round(k.means.fit.6$centers,2))| Cluster | Range | MvSpd25 | Dmg | DPS | DPS20 | Arm | Arm20 | Mana | Mana20 | HP | EffHP | HP20 | EffHP20 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | -0.96 | -1.25 | 2.27 | 1.40 | 0.17 | -1.49 | -1.87 | -0.71 | -0.77 | 1.69 | 0.26 | 2.24 | 0.56 |
| 2 | -0.84 | 0.94 | -0.01 | 0.52 | 0.91 | 1.22 | 1.31 | -0.34 | -0.60 | -0.16 | 0.74 | -0.68 | 0.46 |
| 3 | 0.61 | -0.09 | -0.89 | -0.81 | 0.54 | -0.31 | 0.87 | -0.77 | -0.72 | -1.02 | -1.03 | -0.81 | -0.15 |
| 4 | 0.99 | -0.25 | -0.43 | -0.45 | -0.37 | -0.73 | -0.86 | 0.42 | 1.00 | -0.69 | -1.01 | -0.51 | -1.12 |
| 5 | -0.97 | 0.30 | 0.49 | 0.35 | -0.19 | 0.19 | -0.06 | -0.50 | -0.66 | 0.63 | 0.62 | 0.89 | 0.81 |
| 6 | 0.72 | -0.49 | -0.10 | -0.12 | -0.40 | 0.22 | -0.10 | 1.25 | 1.00 | 0.32 | 0.38 | -0.18 | -0.29 |
# visual comparison of all clusters, normalized variables
group.6 <- melt(data.frame(Hero=derived$Hero, k.means.data, ClustNo=k.means.fit.6$cluster),
id.vars=c('Hero','ClustNo'))
plot_ly(data=group.6, x=~variable, y=~value, color=~factor(ClustNo), text=~Hero, hoverinfo='text+name',
type='scatter', mode='markers', colors='Dark2', alpha=0.5) %>%
layout(title='All Clusters', xaxis=list(title='Normalized Attributes'), yaxis=list(title='Deviation'))At this point, the coding and mathematics are done. Overall, the project was a success, having confirmed that there was a natural way of classifying heroes based purely on stats (2 cluster solution), and further having justified solutions to group heroes in more specific, meaningful ways that facilitate learning about heroes in groups (3 & 6 cluster solutions). This analysis does not even consider the unique abilities that every hero has, yet it seems there is insight to be gained with this limited view of a hero and that this method was sufficient to highlight those patterns.
Two Cluster Classification:
The most significant natural separation of heroes is into two clusters. From the plot, a few factors stick out has having the most distinct separation between clusters: Range, Lv20 Mana, and Lv1 Effective HP showed very stark contrasts between clusters, and indeed these three factors alone are enough to paint a picture of how a representative hero might play. This describes the archetypal differences between ranged and melee heroes, a distinction that already made by the game. While this is an expected result that lends credibility to the method, having only two large clusters describing an existing classification does not provide much additional insight. For the purpose of learning, this solution is not especially useful.
Three Cluster Classification:
The next most efficient clustering also resembles an existing classification, a hero’s Attribute. The cluster centres describe the archetypal Intelligence, Agility, and Strength heroes, respectively. However, there is an even greater number of ‘misclassifications’ (25/117), where a hero’s naturally assigned cluster does not match its true attribute.
Rather than being errors, closer inspection suggests that most heroes match the archetype of their cluster moreso than their attribute. For example, a melee Intelligence hero with high HP & effective HP is grouped with Strength heroes, rather than the long-ranged and low-HP Intelligence heroes. I believe this solution can be used as a guide to finding these atypical cases, and proposing a more suitable play-style based on their stats, rather than relying on their (potentially misleading) attribute. The value of clustering is that this correction does not require a player to know all the stats of individual heroes in order to do so.
Six Cluster Classification:
Although it has the weakest quantitative justification, this solution provides a different benefit. Having more clusters allows each cluster centre to be more closely fit to the heroes assigned to it, making the centres more representative guides to their group. This allows for more nuanced groupings to develop, and the Cluster 6 is especially illustrative, containing only five heroes that exhibit dramatic deviations from any other cluster. Their Lv1 Damage and HP are well above average, posing a significant threat at the start of a game. Yet these are all short-range (melee) heroes that move slower than the average, and their low armour contributes to overall average Effective HP, suggesting that they can be evaded by more mobile heroes, and effectively neutralize the potential threat. Perhaps a cluster 3 hero might be a suitable counter, with average movespeed and above-average range, and with Armour and DPS that far outscale cluster 6 as levels increase though a game. Further, this description seems applicable to all heroes in the cluster, effectively removing the need to generate five different strategies when a single approximation may be good enough.
Such insights, patterns that directly yield potential strategies for regular gameplay, were the goal of this project. In that regard, this solution provides the most value in that the groupings are descriptive of divisions that are not already advertised by the game, and the clusters are small enough that strategies based on the cluster centre seem to match quite well to almost every member. For my own purposes of actually learning the game, the six cluster solution was the most valuable guide.